Authors and Affiliations:
Manuel Freire, Universidad Autónoma de Madrid, manuel.freire@uam.es [PRIMARY]
Awalin Sopan, University of Maryland, awalin@cs.umd.edu
Tool(s):
We have used AC and ManyNets to analyze the dataset. AC is a plagiarism
detection tool that is also useful for general similarity analysis, developed
at the Universidad Autónoma de Madrid by Manuel Freire. It is available at http://tangow.ii.uam.es/ac/, though we have
modified it by adding a similarity test that is specific for this
Mini-Challenge. ManyNets is a network analysis and visualization tool developed
at the Human-Computer Interaction Lab (HCIL), University of Maryland, by both
authors. More information on this tool is available at http://www.cs.umd.edu/hcil/manynets/
- source and binaries are available on demand. ManyNets uses SocialAction, also
developed at the HCIL, to display node-link diagrams.
The new test computes distance between two same-length strings as the
number of single-character differences, and can generate a file with those
distances suitable for use within a ManyNets dataset. We rate the programming
effort to add this test at around 3 hours. Before developing this test, we
experimented with AC’s default similarity tests, based on normalized
compression distance (NCD). However, since all differences seem to correspond
to single-character substitutions (we found no fragment duplications or
translocations), the extra power afforded by NCD is unnecessary. Additionally,
questions 3.3 and 3.4 directly ask for lists of mutations.
Before loading the data into AC, we used a small C program to split the
sequences into individual files within individual folders. This required around
30 minutes of time, and could have been accomplished manually using a text
processor. A similar amount of time was spent building suitable datasets from
the output of AC’s new challenge-specific test.
Video:
ANSWERS:
MC3.1: What is the region or country of origin for the
current outbreak? Please provide your answer as the name of the native
viral strain along with a brief explanation. (max 150 words)
The native strain is Nigeria_B.
We created separate files for each strain. We then
loaded them into AC, and compared all strains using a test that assigns
distance(a,b) = changed_characters(a,b), normalized between 0 and 1 to comply
with AC’s test conventions.
Figure 1: Network view from AC, showing strains as
nodes and distances as edges. The dialog window displays a side-by-side
comparison between 531 and Nigeria_B. The longest duplicate runs are
color-coded in matching colors; this would have revealed re-arrangements of
entire subsequences, but we only see single-character substitutions. The histogram-with-slider
located under the network view is used to hide edges with distance > 0.09.
Figure 2: Dendrogram view from AC. Branch x-coordinate
corresponds to the distance at which branches merge using complete linkage, and
the slider controls network visibility. Branches within the shaded area,
controlled with the slider, are visible in the network pane. Nigeria_B is close
to the mutant strains and far from other country-strains.
MC3.2:
Over time, the virus spreads and the diversity of the virus increases as it
mutates. Two patients infected with the Drafa virus are in the same
hospital as Nicolai. Nicolai has a strain identified by sequence
583. One patient has a strain identified by sequence 123 and the other
has a strain identified by sequence 51. Assume only a single viral strain
is in each patient. Which patient likely contracted the illness from
Nicolai and why? Please provide your answer as the sequence number along
with a brief explanation. (max 150 words)
The patient with strain 123 probably contracted the
disease from Nicolai. Strains 583 and 123 are very close. Strain 51 is closer
to other strains (most notably 531) than to 123 or 538.
Figure 3: From within AC, this network is centered on
583 (country strains are no longer included). Thick, red edges correspond to
smaller distances, such as between 583 and 123, with 1 change. There are 3
changes between 583 and 51, represented with thin edge; 51 can also be seen to
be similar to 531 than to any other node.
We built a ManyNets dataset from the distances
generated by AC. We filtered the resulting (complete) network to remove all
edges that do not include 583 as source.
Figure 4: All edges with source in 583 as seen in
ManyNets, sorted by number of base-substitutions, with the edges for 583-to-123
and 583-to-51 highlighted.
MC3.3:
Signs and symptoms of the Drafa virus are varied and humans react differently
to infection. Some mutant strains from the current outbreak have been
reported as being worse than others for the patients that come in contact with
them. (max 150 words)
Identify the top 3 mutations that lead to an increase
in symptom severity (a disease characteristic). The mutations involve one
or more base substitutions. For this question, the biological properties
of the underlying amino acid sequence patterns are not significant in
determining disease characteristics.
For each mutation provide the base substitutions and
their position in the sequence (left to right) where the base substitutions
occurred. For example,
C → G, 456 (C changed to G at position 456)
G → A, 513 and T → A, 907 (G changed to A
at position 513 and T changed to A at position 907)
A → G, 39 (A changed to G at position 39)
Taking, as a root, strain 531,
T -> C, 841 and A -> T, 954 (path from 531 to 583)
A
-> C, 160 and A -> G, 222 and T -> C, 789 (path from 531 to 952)
A
-> C, 268 and A -> T, 842 (path from 531 to 99)
Using ManyNets, and displaying the network with SocialAction,
we can interactively filter edges to remove all those that represent more than X
base-substitutions. All (14) “Severe” strains are still connected to the main
component (a tree) by single base-substitution edges.
Figure 5: Coloring the single base-substitution tree by
symptom severity. Red = Severe, Brown = Moderate, Blue = Mild. The answer for
MC3.3 describes the mutations from 531 to the three branches (= minimal base-substitution
paths) containing severe cases.
Figure 6: Mutations from 531. The central column
contains vertical dashes to represent base-substitutions from 531. Hovering
over any of these dashes describes the corresponding substitution. Selecting
the “Severe” cases and looking at the dashes allows hypotheses to be formulated.
MC3.4:
Due to the rapid spread of the virus and limited resources, medical personnel
would like to focus on treatments and quarantine procedures for the worst of
the mutant strains from the current outbreak, not just symptoms as in the
previous question. To find the most dangerous viral mutants, experts are
monitoring multiple disease characteristics.
Consider each virulence and drug resistance
characteristic as equally important. Identify the top 3 mutations that
lead to the most dangerous viral strains. The mutations involve one or more
base substitutions. In a worst case scenario, a very dangerous strain
could cause severe symptoms, have high mortality, cause major complications,
exhibit resistance to anti viral drugs, and target high risk groups. For
this question, the biological properties of the underlying amino acid sequence
patterns are not significant in determining disease characteristics.
For each mutation provide the base substitutions and
their position in the sequence (left to right) where the base substitutions
occurred. For example,
C → G, 456 (C changed to G at position 456)
G → A, 513 and T → A, 907 (G changed to A
at position 513 and T changed to A at position 907)
A → G, 39 (A changed to G at position 39).
Using 531 as root, the 3 most dangerous strains are
A -> C, 268 and T -> C, 841 and A -> T, 945
(strain 123)
A -> C, 268 and T -> C, 526 and T -> C, 841
and A -> T, 945 (strain 118)
A -> C, 196 and T -> C, 841 and G -> C, 847
and A -> T, 945 (strain 501)
We followed the same procedure as before, but instead
of using “severity”, we built a composite column that adds up all the numerical
values of each of the tabulated disease characteristics. This is not exactly
“equal weight”, since most characteristics have 3 values (mapped to 1, 2 and
3), as compared to “complications”, which has only 2 (mapped to 1 and 2).
Mapping major “complications” to a 3 instead of a 2 would cause a 4-way tie for
“most dangerous strain”, adding 202 to the current list of 123, 118 and 501.
Figure 7: The resulting table. The last column is
formed by adding up the numerical mappings of all strain-characteristic
columns. After sorting by this “badness-column”, we have used the topmost 3
cases as our answer, looking up the mutations in the corresponding edge table.
Figure 8: The resulting one-base-substitution-per-edge
tree, encoding overall danger as node color. Red is most dangerous (most danger
points, as per the “badness” column In Figure 7), blue is least dangerous.
Figure 9: The substitutions required to mutate from
531 to the 3 most dangerous strains. These strains have been filtered from the
list of all edges from 531 to all other strains – similar to that of Figure 4.
Figure 10: On the independence of single-base
substitutions. The table contains all edges within the single-base-substitution
network seen in previous figures. Strain 770 is not present, as it is at least 2
substitutions apart from any other strain. Notice that strain 770 is not
interesting in the context of tasks 3.3 and 3.4: it is of moderate symptoms,
medium mortality, minor symptoms, and low at-risk vulnerability. There are 55
unique edges in this network (discarding reverse edges), and 1406 * 3 possible
mutations. If base-substitutions are random and independent, we expect few, if
any, duplicate base substitutions. We have highlighted the single duplicate (A
-> C, 268 is present in 2 branches) - not wholly unexpected, according to
the birthday paradox.
The tree represented in Figure 5 and Figure 8 is a
valid way to reason on the paths between strains, since the shortest path
between any pair of strains passes through the branches of that tree, with a
single exception. Figure 10 is critical in that regard: it proves that, except
in the case of 99 – 123 (where the shortest path avoids the duplicate A ->
C, 268 edge), this is indeed the case. Any other exceptions would have showed
up as duplicates. This makes the analysis considerably simpler, providing a
clear most-probable ancestry for each strain. We do not have ground truth, so
the most probable (shortest, step-by-step) paths need not correspond to actual strain
evolution; however, assuming independence of mutations and almost complete data
(there are only two small gaps, to 770 and to 49 / 961 – a small pair seen
isolated in Figure 8), our confidence in our answers is very high.
Figure 11: In this dataset, it is entirely equivalent
to look at the frequency of mutations as it is to look at the tree-paths in the
single-substitution network, as each mutation occurs in exactly one tree path
(again, with the small exception of A -> C, 268). However, it is much easier
to reason looking at the tree than looking at sequences of mutations (tree-paths).
The leftmost figure answers question 3.3 using the “multi-column overview”
feature of the latest version of ManyNets. The rightmost figure provides an
alternate, equivalent view showing the node table of the network with edges
from 531 to other strains. Annotations have been added to both figures
externally to ManyNets, but roughly correspond to the tooltips that would have
been presented when mousing over the relevant areas.
As a final observation, we have chosen strain 531 as
root for our answers to tasks 3.3 and 3.4 due to its high betweenness
centrality (making it ideal to describe mutations with a minimal number of base
substitutions and most likely evolutionary path), and its status as most
probable original outbreak strain (nearest to Nigeria_B) as described in our
answer to task 3.1.